Exivity Kubernetes best practices
This document describes the recommended Kubernetes deployment patterns for Exivity in self-managed, on-premises environments. It is intended as a prescriptive starting point for implementation teams that need a default architecture for single-node, multi-node, and multi-site deployments.
The recommendations below assume a Linux Kubernetes cluster, Helm-based deployment using the Exivity chart, and self-managed infrastructure components such as ingress, storage, PostgreSQL, and RabbitMQ.
Exivity relies on third-party infrastructure and middleware to run on Kubernetes, including Kubernetes, PostgreSQL, RabbitMQ, ingress controllers such as NGINX Ingress Controller or Traefik, and storage platforms such as Longhorn or NFS-backed storage.
These products are third-party infrastructure that you operate. Exivity documents how the application depends on them, but you are responsible for selecting, operating, securing, backing up, monitoring, and supporting those third-party products according to their vendor documentation and your internal platform standards.
Deployment scenarios
| Scenario | Recommended use | Preferred architecture |
|---|---|---|
| ☸️ Single-node Kubernetes | Small environments, evaluation, non-HA production where simplicity is preferred | One Kubernetes node, ingress/TLS, embedded or external PostgreSQL, embedded RabbitMQ, provisioner-backed local RWO storage (or RWX if already available) |
| ☸️ Multi-node Kubernetes | Production HA deployments within one site | Multi-node Kubernetes, Longhorn RWX storage, external PostgreSQL, site-local in-cluster RabbitMQ, ingress/load balancer |
| ☸️ Multi-site Kubernetes | Disaster recovery across sites | Active/passive sites with replicated PostgreSQL, independent RabbitMQ per site, independent storage per site, GitOps-controlled failover |
Common foundations
Use these foundations for all Kubernetes scenarios.
| Area | Recommendation |
|---|---|
| ☸️ Kubernetes | Use a CNCF-conformant Kubernetes distribution on Linux nodes with a stable CSI driver and production ingress controller. Known Exivity deployments run on upstream Kubernetes, Rancher (RKE2/K3s), and Red Hat OpenShift. Other CNCF-conformant distributions are likely to work but should be confirmed with Exivity support before production use. Lightweight learning distributions such as Minikube, Kind, and Docker Desktop are intended for development only. |
| ⎈ Helm | Use Helm 3 and maintain deployment values in version control. |
| 🗂️ Namespace | Deploy Exivity into a dedicated namespace, normally exivity. |
| 🚦 Ingress / load balancer | Use a production ingress controller such as NGINX or Traefik. Terminate TLS at ingress or at an upstream load balancer. |
| 🛡️ TLS | Use cert-manager, enterprise PKI, or an existing TLS secret. Do not expose production Exivity over plain HTTP. |
| 🔐 Secrets | Set secret.appKey and secret.jwtSecret explicitly for production. Do not rely on generated values. |
| 📦 Image registry | Mirror images to an internal registry for restricted or air-gapped sites. |
| 🔄 Backups | Back up PostgreSQL and Exivity shared data. Do not rely only on persistent volumes for recovery. |
| 📈 Monitoring | Enable Kubernetes, ingress, PostgreSQL, RabbitMQ, and storage monitoring. Enable the Exivity ServiceMonitor where Prometheus Operator is used. |
| 📄 Logs | Tune log retention with logfiles.deleteDays and logfiles.compressDays to match your retention and storage requirements. |
PVC sizing
The chart defaults are intentionally small and are usually not appropriate for production. Use the following as starting values and size upward for high-volume or multi-tenant environments.
| PVC group | Volume | Recommended size |
|---|---|---|
| 📚 Data | extracted | 50-100Gi |
| 📚 Data | exported | 50-100Gi |
| 📚 Data | import | 10-20Gi |
| 📚 Data | report | 10-20Gi |
| 📄 Logs | All service log PVCs | 5-10Gi each |
| ⚙️ Config | etl, griffon, chronos | 1Gi |
| 🐘 PostgreSQL | Embedded PostgreSQL or CloudNativePG instance volume | 25-50Gi |
For CSI-backed storage such as Longhorn, PVC sizes are enforced by the storage backend. For NFS-subdir provisioners, PVC sizes may be advisory only, but they should still be set to document intent and simplify future migration to CSI-backed storage. For local-path style provisioners on single-node deployments, PVC sizes are advisory and not reserved against node disk capacity, so always validate the sum of requested PVC sizes against the actual node disk size and monitor node disk free space.
extracted and exported typically grow fastest because they depend on data source volume, retention, and report frequency. Prefer 100Gi for larger environments or high-frequency reporting.
Scenario A: single-node Kubernetes
Single-node Kubernetes is suitable when HA is not required or when you want the smallest possible Kubernetes footprint. It is also useful for demos, evaluation, and small production environments with clear recovery expectations.
Architecture
This table describes the role of each layer in the diagram above. It is descriptive, not prescriptive.
| Layer | Role in this scenario |
|---|---|
| 👥 Users / API clients | Reach Exivity through DNS and the cluster ingress endpoint. |
| 🛡️ Ingress / TLS | Routes / to glass and API paths to proximity-api; terminates TLS. |
| ☸️ Kubernetes node | Single node hosting the control-plane, worker role, all Exivity services, RabbitMQ, PostgreSQL, and shared storage. |
| 🐘 PostgreSQL | Either embedded (in-cluster) or external; both stay in the same single-node footprint. |
| 🐇 RabbitMQ | Embedded in-cluster RabbitMQ used for transient communication. |
| 💾 Shared volumes | Hold logs, config, and pipeline data; mounted into the Exivity services running on this node. |
Configuration
This table lists the choices to make for a single-node deployment. It is prescriptive.
| Decision | Recommended value |
|---|---|
| ☸️ Kubernetes | One Linux node running both control-plane and worker roles. |
| 🐘 PostgreSQL | External PostgreSQL is preferred. Embedded PostgreSQL is acceptable for evaluation and small environments. |
| 🐇 RabbitMQ | Use site-local in-cluster RabbitMQ. The embedded chart dependency is acceptable for evaluation. For production, prefer the RabbitMQ Cluster Operator running a single RabbitMQ node, because the embedded chart relies on the unsupported bitnamilegacy image. See the RabbitMQ section for details. |
| 💾 Storage access mode | RWX is not required. Set storage.sharedVolumeAccessMode: ReadWriteOnce because every Exivity pod runs on the same node. |
| 💾 Storage class | Use a provisioner-backed local StorageClass. Validated examples include Docker Desktop's hostpath, K3s' built-in local-path, and local-path-provisioner. Do not point Exivity directly at unmanaged raw hostPath volumes; always go through a StorageClass/provisioner. NAS/NFS is a valid alternative when you already operate reliable NAS, want to decouple storage from the Kubernetes node, want easier node rebuild/replacement, or anticipate migrating to multi-node later. NAS/NFS does not make Exivity HA when Kubernetes itself is still single-node. Longhorn works on a single node but provides limited HA value there because replicas cannot be spread across nodes. |
| 🚦 Ingress / load balancer | Any CNCF-conformant Kubernetes ingress controller with TLS termination is supported. Proven options include Traefik, NGINX Ingress Controller, and HAProxy Ingress. Reach the cluster through a LoadBalancer service provided by your platform (cloud provider's native load balancer or an upstream hardware load balancer). On bare-metal Kubernetes without a cloud provider, an implementation such as MetalLB can fill that role; treat it as one option among hardware and software load balancers and confirm operational fit with your platform team. |
| 🔄 Backups | Back up PostgreSQL and shared data. Test the restore path before production handover. |
Local-path style provisioners do not track or reserve aggregate disk capacity across PVCs on the node. The sum of requested PVC sizes can exceed available disk without Kubernetes blocking it. Size all Exivity PVCs against the actual node disk capacity, leave headroom for PostgreSQL, logs, and image growth, and monitor node disk free space.
Reference values: charts/exivity/examples/best-practice-single-node.yaml
Scenario B: multi-node Kubernetes, single site
Multi-node Kubernetes is the preferred architecture for production HA environments within one site. This is the default recommendation for larger deployments.
Architecture
This table describes the role of each layer in the diagram above. It is descriptive, not prescriptive.
| Layer | Role in this scenario |
|---|---|
| 👥 Users / API clients | Reach Exivity through an external load balancer in front of cluster ingress. |
| 🛡️ Ingress / TLS | Routes traffic to multiple stateless Exivity replicas; terminates TLS. |
| ☸️ Kubernetes worker nodes | Multiple nodes spread across failure domains; host the Exivity application tier and middleware. |
| 🧩 Application tier | Stateless services (frontend, API, backend) run with multiple replicas; workflow and ETL services run as singletons. |
| 🐘 PostgreSQL | External or in-cluster Kubernetes-native PostgreSQL serving the active workload from the same low-latency site. |
| 🐇 RabbitMQ | Site-local in-cluster RabbitMQ used for transient communication. |
| 💾 Shared storage | RWX-capable storage shared across nodes; holds logs, config, and pipeline data. |
Configuration
This table lists the choices to make for a multi-node single-site deployment. It is prescriptive.
| Decision | Recommended value |
|---|---|
| ☸️ Kubernetes | Use at least three worker nodes. For HA control-plane requirements, also use three control-plane nodes. |
| 📍 Node placement | Spread nodes across racks, chassis, failure domains, or availability zones where available. |
| 🐘 PostgreSQL | Use external PostgreSQL for production. For self-hosted Kubernetes PostgreSQL, use CloudNativePG. |
| 🐇 RabbitMQ | Run RabbitMQ site-local in-cluster. Use the RabbitMQ Cluster Operator for production because the embedded chart dependency relies on the unsupported bitnamilegacy image. External or managed RabbitMQ is optional when required by your platform standards. See the RabbitMQ section for details. |
| 💾 Storage access mode | RWX is required because Exivity pods run across multiple nodes. Keep storage.sharedVolumeAccessMode: ReadWriteMany. |
| 💾 Storage class | Prefer Longhorn with three replicas per volume. An HA NAS/NFS platform that exposes RWX is a valid alternative when Longhorn or an equivalent CSI RWX storage class is not available. Avoid using a simple in-cluster NFS server (for example, the NFS Ganesha server and external provisioner backed by a single PVC) as the HA default unless its backing storage and node placement are explicitly designed for HA. |
| 🚦 Load balancer | Use a hardware load balancer or your existing load balancing platform in front of ingress. On bare-metal Kubernetes without a cloud-provided load balancer, an implementation such as MetalLB (L2 or BGP mode) can fill that role; confirm operational fit with your platform team before treating it as production-default. |
| 👥 Application replicas | Scale stateless frontend/API/backend services to at least two replicas. Keep workflow and ETL-style services singleton unless Exivity confirms a scaling pattern for your workload. |
| 📆 Scheduling | Use node anti-affinity or topology spread constraints where the platform supports it. |
Service replica guidance
The following is a conservative starting point. Scale after observing CPU, memory, queue depth, and report preparation behavior.
| Service | Starting replicas | Notes |
|---|---|---|
glass | 2 | Stateless UI. |
proximityApi | 2 | Stateless API; scale horizontally behind ingress. |
edify, horizon, pigeon, transcript, use | 2 | Pull work from RabbitMQ queues (REPORT, BUDGET, PIGEON/WORKFLOW_EVENT/REPORT_PUBLISHED, TRANSFORM, and EXTRACT respectively). RabbitMQ delivers each queued job to one consumer, so multiple replicas distribute load and increase throughput. |
chronos, executor, griffon, proximityCli | 1 | Must remain singletons. These services own scheduling, workflow dispatch, and CLI execution, where multiple replicas would duplicate work. |
RabbitMQ ensures each queued message is delivered once, but it does not stop you from queueing the same logical task (the same extractor, transformer, or report for the same period) twice. Running the same task concurrently can produce overlapping writes to extracted, exported, or report, regardless of how many replicas a service has. This is a workflow-design concern, not a replica-count concern: design schedules and triggers so the same task for the same period is not enqueued in parallel.
Reference values: charts/exivity/examples/best-practice-multi-node.yaml
Scenario C: multi-site active/passive
For deployments spanning multiple physical sites, Exivity recommends active/passive. The active site runs the application and middleware. The passive site continuously receives replicated data and is promoted during a failover event.
Active/passive avoids the operational complexity of active/active PostgreSQL writes, RabbitMQ stretching, Longhorn stretching, and workflow execution conflicts.
Architecture
This table describes the role of each layer in the diagram above, comparing the active and passive sites side by side. It is descriptive, not prescriptive.
| Layer | Active site | Passive site |
|---|---|---|
| 🚦 Traffic routing | DNS, GSLB, or load balancer sends users to the active ingress. | Standby ingress is prepared for failover but does not receive normal traffic. |
| 🧩 Application tier | Exivity service replicas are greater than 0. | Exivity service replicas remain 0 until failover. |
| 🐘 PostgreSQL | Runs the primary database endpoint. | Receives replicated data or restores from validated backups before promotion. |
| 🐇 RabbitMQ | Runs an independent site-local RabbitMQ instance. | Runs a separate site-local RabbitMQ instance; RabbitMQ state is not replicated. |
| 💾 Storage | Uses site-local shared storage and backup replication. | Uses independent site-local storage restored or attached during failover. |
| ☸️ GitOps | Controls scaling, routing, and failover changes through versioned state. | Promotes the site through the same repeatable workflow. |
Configuration
This table lists the choices to make for a multi-site active/passive deployment. It is prescriptive.
| Decision | Recommended value |
|---|---|
| 🧩 Application | Run Exivity only in the active site. Keep passive-site application replicas at 0 until failover. |
| 🐘 PostgreSQL | Use active/passive PostgreSQL replication. For CloudNativePG, use a replica cluster or a supported backup/restore promotion pattern. |
| 🐇 RabbitMQ | Do not stretch RabbitMQ across sites. Deploy one site-local RabbitMQ instance per site. RabbitMQ state is not replicated between sites. |
| 💾 Longhorn / storage | Do not stretch a Longhorn cluster across sites. Use independent Longhorn clusters per site and replicate data through backups or storage-layer replication supported by your platform. |
| 🚦 DNS / load balancing | Use DNS, GSLB, or your load balancing platform to route users to the active site. |
| ☸️ Failover control | Use GitOps for repeatable failover. Argo CD with Argo Workflows or Argo Events is the preferred implementation pattern. |
Required GitOps failover pattern
A multi-site deployment must have a version-controlled, tested failover workflow. The workflow should perform the following actions in order:
- Mark Site A unavailable and stop routing new traffic to it.
- Scale Site A Exivity application replicas to
0if the cluster is reachable. - Promote the Site B PostgreSQL replica or restore the latest validated backup, depending on the PostgreSQL design.
- Ensure Site B RabbitMQ is available and configured for Exivity.
- Restore or attach the required Site B shared data volumes.
- Scale Site B Exivity application replicas above
0. - Switch DNS, GSLB, or load balancer traffic to Site B.
- Run application validation checks before handing the service back to users.
If you do not have GitOps practices, implement this as a documented runbook, but understand that this is not the preferred operating model. For best-practice multi-site deployments, GitOps is required to reduce failover risk and make the process repeatable.
Reference values: charts/exivity/examples/best-practice-multi-site-active-passive.yaml
Active/active across sites
Active/active across sites is discouraged and should not be used as the default architecture.
The main concerns are:
| Concern | Impact |
|---|---|
| 🐘 PostgreSQL write conflicts | Bidirectional PostgreSQL replication is complex and can introduce conflict handling requirements that Exivity does not need in active/passive mode. |
| 📆 Workflow scheduling | Only one site should execute workflows unless there is a clear leader-election or workload partitioning design. Otherwise, work may be duplicated or events may not progress as expected. |
| 🐇 RabbitMQ stretching | RabbitMQ clusters should not be stretched across high-latency links for this use case. |
| 💾 Storage stretching | Longhorn should not be stretched across sites. Site-local storage is simpler and safer. |
| ⏱️ Latency | WAN latency to PostgreSQL can significantly affect report preparation and other database-heavy operations. |
If you need active/active, treat it as a custom architecture and involve Exivity engineering before committing to the design.
Middleware recommendations
The middleware products in this section are third-party dependencies. Exivity requires compatible database, message queue, storage, networking, backup, and monitoring services, but the operation and support of those services remains your responsibility or that of your chosen platform/vendor.
PostgreSQL
PostgreSQL is the most important stateful dependency. Production deployments should use external PostgreSQL rather than the embedded Bitnami dependency shipped with the chart.
Recommended options:
| Option | Recommendation |
|---|---|
| 🐘 Managed or standard PostgreSQL | Preferred where you already operate a supported HA PostgreSQL platform. |
| ☸️ CloudNativePG | Recommended for self-hosted PostgreSQL on Kubernetes. See the CloudNativePG documentation. |
| 🐘 Embedded Bitnami PostgreSQL | Acceptable for evaluation and small single-node deployments only. Not recommended for production HA. |
Starting recommendations:
| Setting | Recommendation |
|---|---|
| 💾 Storage | 25-50Gi minimum. Monitor and expand before reaching 70% utilization. |
| 🔁 Replication | Use active/passive HA within a site or across sites. |
| 🔄 Backups | Use PostgreSQL-native backups. For CloudNativePG, use Barman Cloud to S3-compatible object storage where available. |
| 🛡️ TLS | Use TLS for database traffic where supported by your platform. |
| 🌐 Latency | Keep Exivity and PostgreSQL in the same low-latency site for active workloads. WAN latency around 15ms or higher can materially affect report preparation. |
RabbitMQ
Exivity uses RabbitMQ for transient application communication and work coordination, not as the primary system of record. Data integrity is primarily tied to PostgreSQL and shared data volumes. For this reason, a site-local in-cluster RabbitMQ deployment is the default recommendation. If RabbitMQ fails, Kubernetes can reschedule it, and interrupted work can be retried without introducing an external middleware dependency. External or managed RabbitMQ is optional when required by your platform standards.
The Exivity Helm chart's embedded RabbitMQ dependency is based on the Bitnami RabbitMQ Helm chart. Following the Bitnami container catalog changes on September 29, 2025, the chart consumes the unsupported bitnamilegacy/rabbitmq image through the Exivity-hosted Bitnami mirror. This is a temporary compatibility measure and will not receive updates or security patches.
Treat the embedded RabbitMQ as suitable only for evaluation and small single-node deployments. For production, keep RabbitMQ site-local but run it outside the Exivity chart, preferably with the RabbitMQ Cluster Operator.
How to run site-local RabbitMQ:
| Implementation | When to use | Notes |
|---|---|---|
| 🐇 RabbitMQ Cluster Operator | Production default for site-local RabbitMQ | Maintained upstream by the RabbitMQ team; uses official rabbitmq images; declarative RabbitmqCluster CRD; supported queue types and policies. |
| 🐇 RabbitMQ Messaging Topology Operator | Optional alongside the Cluster Operator | Lets you manage vhosts, users, queues, exchanges, bindings, and policies as Kubernetes resources. See the Messaging Topology Operator overview. |
| 🐇 Embedded chart dependency | Evaluation and small single-node only | Based on the Bitnami chart and bitnamilegacy image; do not treat as a long-term production architecture. |
| 🐇 Managed RabbitMQ | Optional, when required by platform standards | Not site-local; only choose this when you already operate a managed RabbitMQ platform. Connect Exivity through the external rabbitmq.host, rabbitmq.port, rabbitmq.vhost, and rabbitmq.secure values. |
Recommended options by scenario:
| Scenario | Recommendation |
|---|---|
| 🐇 Single-node | Use site-local in-cluster RabbitMQ. The embedded chart is acceptable for evaluation; prefer the RabbitMQ Cluster Operator (single node) for long-term deployments. |
| 🐇 Multi-node single-site | Use site-local in-cluster RabbitMQ via the RabbitMQ Cluster Operator. External or managed RabbitMQ is optional when required by your standards. |
| 🐇 Multi-site | Run one independent site-local RabbitMQ deployment per site. Do not stretch RabbitMQ across sites and do not replicate RabbitMQ state between sites. The RabbitMQ Cluster Operator is the preferred way to run each site-local deployment. |
Starting recommendations:
| Setting | Recommendation |
|---|---|
| 🐇 Clustering | Keep clustering disabled by default. Only enable clustering for a dedicated multi-node RabbitMQ design. |
| 🐇 Queues | Prefer quorum queues for new RabbitMQ designs where compatible with your RabbitMQ version and policy model. |
| 💾 Persistence | Use persistent storage for production RabbitMQ. |
| 📈 Monitoring | Monitor queue depth, memory, disk free space, and connection count. |
Confirm final RabbitMQ values against your chosen RabbitMQ deployment method before applying production tuning.
Longhorn
Longhorn is the preferred storage provider for HA Exivity Kubernetes deployments when it is available in your environment. It is considered mature enough for production use and is generally more resilient than a standard in-cluster NFS provisioner in HA environments.
Recommended options:
| Scenario | Recommendation |
|---|---|
| 💾 Single-node | RWX is not required. Use a provisioner-backed local StorageClass (Docker Desktop hostpath, K3s local-path, or local-path-provisioner) with storage.sharedVolumeAccessMode: ReadWriteOnce. NAS/NFS is a valid alternative when you already operate reliable NAS or want storage decoupled from the node, but does not by itself make Exivity HA when Kubernetes is single-node. Longhorn is possible but provides limited HA value on one node because replicas cannot be spread across nodes. |
| 💾 Multi-node single-site | Prefer Longhorn with three replicas per volume. |
| 💾 Multi-site | Use one independent Longhorn deployment per site. Do not stretch Longhorn across sites. |
Starting recommendations:
| Setting | Recommendation |
|---|---|
| 💾 Replicas | Configure three replicas per volume for HA environments. |
| ☸️ Replica placement | Spread replicas across nodes and failure domains where possible. |
| 🔄 Backups | Configure recurring snapshots and recurring backups to an S3-compatible or otherwise approved backup target. |
| 📊 Capacity | Size disks for usable capacity after three-way replication and snapshot overhead. |
| 💾 RWX | Validate RWX behavior before production, including share-manager scheduling and failover. |
Confirm final Longhorn StorageClass and recurring job values against the current chart before applying production tuning.
NFS
When NFS is used as RWX storage for Exivity, deploy the NFS Ganesha server and external provisioner (the nfs-server-provisioner Helm chart) rather than an unspecified NFS server. It serves NFSv4 with file locking, which Exivity requires, and is the reference NFS provisioner used in this documentation.
This in-cluster provisioner can work well for smaller or simpler deployments. For HA environments, prefer an external HA NAS platform that exposes NFSv4, because the in-cluster provisioner becomes a single point of failure unless its backing storage and node placement are explicitly designed for HA.
Operations checklist
Before production
| Check | Requirement |
|---|---|
| 💾 Storage | RWX storage class validated with Exivity PVCs. |
| 📶 PVC sizes | Production PVC sizes set explicitly. |
| 🐘 PostgreSQL | HA design, backups, restore, and monitoring validated. |
| 🐇 RabbitMQ | Connectivity, authentication, TLS, and monitoring validated. |
| 🚦 Ingress / load balancer | DNS, TLS certificate, ingress class, and trusted proxy behavior validated. |
| 🔐 Secrets | Production secret.appKey, secret.jwtSecret, PostgreSQL password, and RabbitMQ password configured. |
| 🔄 Backups | Restore test completed. |
| 📈 Monitoring | Cluster, application, PostgreSQL, RabbitMQ, ingress, and storage alerts configured. |
Day-2 operations
| Area | Recommendation |
|---|---|
| ⎈ Upgrades | Run Helm upgrades from version-controlled values. Back up PostgreSQL before upgrades. |
| 📊 Capacity | Monitor PostgreSQL, extracted, exported, and log volume growth. |
| 📄 Logs | Lower retention before expanding log PVCs unnecessarily. |
| 🔁 DR | Test failover regularly for multi-site deployments. |
| 🔐 Security | Rotate credentials according to your security policy and keep images patched. |
Example values files
Use these example files as starting points, not as final production values:
| Scenario | File |
|---|---|
| Single-node | charts/exivity/examples/best-practice-single-node.yaml |
| Multi-node | charts/exivity/examples/best-practice-multi-node.yaml |
| Multi-site active/passive | charts/exivity/examples/best-practice-multi-site-active-passive.yaml |